Instance selection for big data based on locally sensitive hashing and double-voting mechanism

نویسندگان

چکیده

The increasing data volumes impose unprecedented challenges to traditional mining in preprocessing, learning, and analyzing, it has attracted much attention designing efficient compressing, indexing searching methods recently. Inspired by locally sensitive hashing (LSH), divide-and-conquer strategy, double-voting mechanism, we proposed an iterative instance selection algorithm, which needs run p rounds iteratively reduce or eliminate the unwanted bias of optimal solution double-voting. In each iteration, algorithm partitions big dataset into several subsets distributes them different computing nodes. node, instances local subset are transformed Hamming space l hash function parallel, is assigned one tables corresponding code, with same code put bucket. And then, a proportion randomly selected from bucket table, obtained. Thus, totally obtained, used for voting select subset. process repeated times obtain subsets. Finally, globally obtained implemented two open source platforms, Hadoop Spark, experimentally compared three state-of-the-art on testing accuracy, compression ratio, running time. experimental results demonstrate that provides excellent performance outperforms baseline methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing

In this paper, we describe a mechanism for ontology alignment using instance based matching of types (or classes). Instance-based matching is known to be a useful technique for matching ontologies that have different names and different structures. A key problem in instance matching of types, however, is scaling the matching algorithm to (a) handle types with a large number of instances, and (b...

متن کامل

Instance selection of linear complexity for big data

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this ...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Voting-based instance selection from large data sets with MapReduce and random weight networks

Instance selection is an important preprocessing step in machine learning. By choosing a subset of a data set, it achieves the same performance of a machine learning algorithm as if the whole data set is used, and it enables a machine learning algorithm to be feasible for and to work effectively with large data sets. Based on voting mechanism, this paper proposes a large data sets instance sele...

متن کامل

Probabilistic Hashing Techniques for Big Data

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Advances in Computational Intelligence

سال: 2022

ISSN: ['2730-7808', '2730-7794']

DOI: https://doi.org/10.1007/s43674-022-00033-z